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Abstract. Community structure is one of the most important proper- 
ties of networks. Most community algorithms are not suitable for large 
networks because of their time consuming. In fact there are lots of net- 
works with millons even billons of nodes. In such case, most algorithms 
running in time 0(n 2 logn) or even larger are not practical. What we need 
are linear or approximately linear time algorithm. Rising in response to 
such needs, we propose a quick methods to evaluate community struc- 
ture in networks and then put forward a local community algorithm with 
nearly linear time based on random walks. Using our community evalu- 
ating measure, we could find some difference results from measures used 
before, i.e., the Newman Modularity. Our algorithm are effective in small 
benchmark networks with small less accuracy than more complex algo- 
rithms but a great of advantage in time consuming for large networks, 
especially super large networks. 
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1 Introduction 

Networks are important tools to study real systems. Nodes in networks usually 
organized into relative densely groups called communities or clusters. Commu- 
nity structure have become one of the important directions. With the computer 
and internet techniques developing, networks we could get become larger and 
larger. Take the liveJournal online social network and U.S. patent dataset for ex- 
ample. LiveJournal is a free on-line community with almost 10 million members 
P]. The U.S. patent dataset is maintained by the National Bureau of Economic 
Research and includes about 3,923,922 patents [T3]. It is reasonable to believe 
lots of other real networks are larger and will increase quickly in future. There 
is a great need for developing quick community detection algorithms. 

To study community structure in large networks, we should evaluate whether 
a given network have community structure and how to find them if there are. So 
far, the most accepted measure to evaluate the community structure is modu- 
larity [5] |18j . However, modularity has an intrinsic scale indicting that modules 
smaller than that scale may not be resolved [TO] and finding the partition with 
maximal modularity is not a trivial thing. Local methods which works indepen- 
dent on the global structure seems more practical in such large datas Q] [3] B- 



One kind of local algorithms divide the whole network into two parts [3] [3] [T7J , 
a community C and the set of nodes with links to C, say B. They usually start 
from a given node s and then explore B and select one or more nodes to merge 
into C. Such operation is repeated until some terminal condition is satisfied. An- 
other kind of local algorithms work in some different ways. Firstly, they calculate 
a vector around a given node s. This vector includes information that indicates 
the tendencies to the C we are to find. Then sort the vector according the score 
and a new vector called a support vector is got. Finally, we could take a sweep 
over this support vector based on some quality and find the community. One of 
such algorithms is [T] which has been used by Leskovec to find lots of interesting 
phenomenon in real networks |15j [16] . 

As we know, random walks have close relationship with community. A ran- 
dom walker from a given position will be 'trapped' in a community with high 
probability. There are lots of community algorithms proposed inspired by this 
idea. A vertices similarity measure and a community similarity measure are pro- 
posed by Latapy and Pons and then communities are get by a agglomerative 
procedure [19]. Also based on random walks, Zhou define a distance between 
pairs of nodes and use divisive procedure to detect communities with running 
time 0(n 3 ) [55] [57] [55]. Community structure could be related to random walk 
through the information theoretic approach where the community detecting pro- 
cedure becomes compressing a description of the probability flow of random 
walk [20] . Delvenne introduce a quality function indicating the persistence of 
clustering over time and unifies the modularity measures [5] |18] as well as sev- 
eral definition related with random walk [7 . Other community algorithm based 
on random walk includes MarkovClusterAlgorithm(AICL) with running time 
0(nfc 2 )[53], methods using signaling process [T5] and methods of minimizing the 
matrix distance [24], More detail about these methods could be found in [9]. All 
of these algorithm scale 0(n 2 logn) or higher, and are not practical in super 
large networks with millons and billons of nodes. In this paper, we give a quick 
measure to evaluate the community structure and a local methods to find com- 
munities based on random walks in nearly linear time, which could be used in 
very large networks. 

We arrange the rest of the paper as follows. In section [2] we propose a modu- 
larity measure based on random walks. Then we give a algorithm and test it on 
benchmarks in section [3] The result of some experiments in very large networks 
are given in section |4] Finally, we give the conclusion in section [5] 

2 Random Walk Modularity 

As pointed above, the random walk has significant indication of network struc- 
ture. We consider the following situation, a random walk is terminated when 
it forms a ring and the corresponding number of steps is called random walk 
length (RWL for short). A question of interesting is how long the expectation 
of RWL is for a given network? We test the relationship between the RWL in 
ER random networks [5J where every pair of nodes are linked with probability 



p and the planted Z-partition model [TT] which has been used largely to test a 
community algorithm's performance. 
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Fig. 1. The relationship between random walk length and p in ER random network 

In the ER random networks where every pair of nodes are linked with random, 
the ring are usually formed by two nodes (the backward walk) as the probability 
more than two nodes form a ring is small when the network size tend to inhnitc. 
In every step, the probability a ring is formed by a backward walk is -i-. We 
assume every nodes have the same degree d for the sake of discussion. Let L r be 
the expectation of the random walk length, qi is the probability that a random 
walker ends with 1 step, we have the following: 

L r = E^ = 5 + 2*(l-i)*i + ... + n*(l-ir-*i 

= d-{d + n)(l--) n 
d 

So when n — > oo the expectation of RWL is mainly affected by d. When the 
degree is not constant, the analysis become complex, but there seems a linear 
relationship between the average random walk length and p see figure Q] 

On the other hand, the ARL has inverse correlation with the community 
structure. The planted ^-partition benchmark is used to illustrate the relation- 
ship between random walk length and community structure. This benchmark has 
been very popular to test the performance of community algorithm since pro- 
posed by Condon and Karp [6] and a special case of planted Z-partition model is 
given by Newman[ll . In the Newman model, 128 vertices are partitioned into 
4 groups with each group 32 vertices. Every vertex has zi n links in the same 
group, and z out links outside of the group. The average total degree of vertices 
are fixed to 16. p is the ratio between z out and average degree of each vertex. 
So the community structure could be controlled by p. The average RWL has a 
close relationship with p see figure [21 This is according with previous idea, a 
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Fig. 2. Average step length a walker have to pass before he encounters a ring in l- 
partition model [TT]. In this model, every node has a fixed average degree 16, p is the 
ratio between the its out degree and its total degree 



random walker could be easily trapped in a community. So the chance that a 
random walk forms a ring increases in networks with clear community structure. 
As p gets larger, the community structure becomes more and more fuzzy. As a 
result, the random walker has more probability to escape from the 'trap' and 
RWL increases. 

From above analysis, we know RWL is mainly affected by the degree sequence 
and the community structure. Inspired by this, we propose a simple community 
evaluating measure called Random Walk Modularity which could be calculated 
in approximate linear time. The definition is as follows: 



Q(G) 
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L(G) means the average random length in graph G, and L(G r ) means the 
average random length in graph G r which is the configure model of G with the 
same degree sequence of G. This measure removes the influence of degree and 
reflects community structure of different networks. 

There are some difference performance in real networks between the Random 
Walk Modularity and the modularity [TB] given by Newman which we called 
Newman Modularity in this paper. Firstly, random network are usually thought 
to have no community structure, but when p is small, the Newman modularity 
and Conductance Modularity (see section^ can be very large. On the contrary 
Random Walk Modularity is not affect by p see figure [3j which is more accord 
with our intuition. Secondly, some deterministic networks, eg. the ring and the 
lattice, have a high Newman Modularity value, but whether they have commu- 
nity structure is disputable. In the lattice, every vertex has the same position, so 
their community structure even if they have does not interest us. The Random 
Walk Modularity all remove such networks by a low value, see table [1] Further 
more, for networks with clear community structure like the planted i-partition 



networks, Random Walk Modularity always give high value compared with other 
networks with fuzzy community structure, see figure 2] Finally, the time to cal- 
culate Random Walk Modularity is mainly determined by the average random 
walk length. Most real networks have average random length less than 10, and 
all of them less than 20 in our experiment, see section [4] So Random Walk Mod- 
ularity can be calculated in a nearly linear time which indicate that it has a 
advantage for large networks. 
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Fig. 3. Different modularity measures in ER random network, every pair of vertices 
are linked with probability p, c_m means conductance modularity, n_m means Newman 
modularity, r_m means random walk modularity 
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Fig. 4. Different modularity measures in planted /-partition model, p is the ratio be- 
tween z out and its average degree, c_m means conductance modularity, n_m means 
Newman modularity, r_m means random walk modularity 



Table 1. Different modularity measures in some deterministic networks. The ring is 
a one dimensional lattice, includes fOOO nodes. The tree includes fOOO nodes with all 
vertices having the same number of children, 2 in our experiments. The lattice have 
two dimensions, each dimension has 100 nodes 
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ring 


tree 


lattice 


RandomWalk 


0.003 


0.0004 


0.08 


Newman 


0.94 


0.93 


0.89 


Conductance 


0.97 


0.93 


0.9 



3 Algorithm 

In the last section, we propose a quick measure to evaluate the community 
structure in large networks. Starting a random walk from a given node, one 
could easily be trapped in a community. The average random walk length has 
a inverse correlation with the community structure. Owing to the community 
structure, the average random walk length becomes shorter. Our community 
algorithm are based on this idea. We perform a series of random walk from a 
given node so, and ends a walk when this walk forms a ring. At the same time, we 
assume the tendencies of nodes to the community of s has a positive correlation 
with the ring position. Our algorithm is given in the following. 

— RandomWalkRing(G, So, n) 

1. Set vector P = <\> 

2. Perform a random walk from Sq, record each node passed, when en- 
counter a ring ends this walk and record its position I. Vw in the random 
walk trail, let P(v) = P(v) + ^ 

3. repeat step2 n times 

4. Order nodes in P by decreasing value P(v), get a support vector S 

5. Compute the conductance <j)(Si) of the first i nodes, for i < |5| 

6. find the index k* at every local optimal of <fi(S), return So — {u»|£ < k*} 

In the above algorithm, we need a quality function to extract the com- 
munity from the support vector. We use conductance as the quality function 
which has been proposed in [T] and has been used by Leskovec [25] • Con- 
ductance has been popular to measure community structure in recent years 
[13] [H] [SI- Let the volume vol(S) of a set S be the total degree of ver- 
tices in it, i.e., vol(S) = X)«eS deg(w). The conductance 4>(S) of a set S is 
defined to be the ratio of the number of edges e(S, S) coming out of S with 
the minimum of the volume of itself and the volume of its complement S 1 , i.e., 
(^(S 1 ) = e(S, S)/ mm{vol(S), vol(S)}. The conductance of the graph is the mini- 
mum conductance over all sets and it is extensively studied in computer science, 
with applications to random walks, spectral or flow based graph partitioning, 
and combinatorial object constructions. Intuitively, a set of low conductance 
(smaller than some constant <po) can be thought of a nice community. Using this 
definition, lots of interesting phenomenon have been found. Leskovec finds most 



Fig. 5. The performance of Random Walk Ring and Local Graph Partitioning algo- 
rithm of Andersen 



networks seems have a 'core' contains a constant faction of the nodes with a 
periphery consisting of a large number of relatively small 'whiskers' |15j . 

Let C\ and Ci are two communities, we use the following community simi- 
larity measures to evaluate our algorithm, where the planted /-partition model 
is also used. 

S(CuC 2 )= jg^L (3) 
y |Oi| * |C 2 | 

For each real communities, we find the most similar communities return by our 
algorithm, see figure [5] for the performance of our algorithm and the algorithm 
of [I]. Our algorithm has similar performance when p is small, and a little less 
accuracy when p is larger than 0.25. What we emphasize is that our algorithms 
is very quick in large networks as pointed before. So such sacrificing of accuracy 
is inevitable. Something should be noticed that the accuracy of our algorithm 
is affected by n a lot. Generally, the larger n is, the more the accuracy of our 
algorithm is. So there is a compromise between the performance and speed. In 
our experiment we set n to 1000. 

Conductance is a local definition and could not give a global knowledge to 
judge whether a network has good community structure or not. We give an- 
other modularity measure which we called Conductance Modularity to differ 
from previous ones. Let c be a real number between and 1, / is the corre- 
sponding fraction of nodes in community with smaller conductance than c. then 
Conductance Modularity is as follows. 



C(G) = max ce[0A] ^(l - c) * f (4) 

As we know, usually a smaller conductance indicates the corresponding commu- 
nity is better. A network with good community structure should have as many 
as possible nodes in good communities. The Conductance Modularity considers 
both community's quality and nodes number, which could give us a intuition 
whether a network has community structure or not from the point of conduc- 
tance. 



Figure 2] and figure [3] are an comparison among three modularity measures. 
All of them are sensitive with community structure in planted ^-partition model. 
While Conductance Modularity and Newman Modularity are affected by p in 
ER random networks a lot, Random Walk Modularity is independent on p and 
seems more better. 

4 Application 

We perform our algorithm on 34 networks in a acceptable time including 
some very large networks. Using our algorithms, we could find more than one 
communities from each node indicating different level of communities just as 
Lcskovcc do [25 . In table [2 all the results are calculated from the first local 
optimal community for the sake of discussion. As in most case we are more care 
about the smallest group includes us, which is always more compact and has 
more influence for us although the conductance is not the optimal in global. 

The Random Walk Modularity has a different interpretation about the com- 
munity structure compared other measures. Before discussion, we give the follow- 
ing classification of network by their corresponding modularity. Random Walk 
Modularity in ER random networks is or very near to 0. So networks with Ran- 
dom Walk Modularity below 0.05 are thought to have no clear community struc- 
ture, between 0.05 and 0.1 are thought to have weak community structure, and 
above 0.1 are thought to have clear community structure. For Newman Modu- 
larity and Conductance Modularity, the boundary of clear community structure 
are set to be 0.3 and 0.5 respectively. 

From the point of Newman Modularity and Conductance Modularity, all net- 
works have clear community structure, except livejournal and vikivote networks 
whose Newman Modularity could not be calculated in acceptable time by the 
fast greedy algorithm[5]- The road, web, amazon and some collaboration, cita- 
tion and email networks have high value, while others networks have relative 
small value. The Newman Modularity and Conductance Modularity are usually 
consistent with each other, which means when one measure give a high score, 
the other is always give a high score. 

When consider the Random Walk Modularity, the situation is different. Net- 
works are divided into three classes as discussed before. The road, p2p, vikivote 
and emaiLeuall networks have no clear community structure, even the road net- 
works have the highest Newman Modularity and Conductance Modularity. Cita- 
tion_arnetminer and Citation, patents networks has weak community structure. 
Other networks are thought to have clear community structure. Such difference 
could be explained by figure [3] and table [T] Those networks with high Newman 
Modularity and Conductance Modularity but small Random Walk Modularity 

1 We only consider the corresponding undirected graphs for all the networks. Except 
the citation_arnerminer network 22 21 is from http: //arnetminer . org/citation 
and the football network is from http://www-personal.umich.edu/~mejn/netdata 
all other networks can be found from http://snap.standford.edu 



Table 2. The statistics of network. RM means Random Walk Modularity, NM means 
Newman Modularity found by fast greedy algorithm 15.: , CM means Conductance Mod- 
ularity, AvgC means the average conductance of all communities, ARL means average 
random walk length, AvgS means the average size of all communities 
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AvgC 
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amazon0302 


0.25 


0.82 


0.79 


0.18 


5.86 


19.95 


amazon0312 


0.31 


0.8 


0.73 


0.24 


8.77 


26.29 


amazon0505 


0.31 


0.76 


0.73 


0.24 


8.86 


26.42 


amazon0601 


0.32 


0.74 


0.73 


0.24 


8.95 


26.96 


cit_arnetminer 


0.06 


0.65 


0.63 


0.42 


6.63 


16.8 


citjiepph 


0.25 


0.56 


0.55 


0.48 


18.5 


28.7 


citjiepth 


0.27 


0.53 


0.59 


0.42 


18.2 


31.6 


cit_patents 


0.08 


0.76 


0.59 


0.45 


8.95 


18.6 


coLastroph 


0.34 


0.51 


0.63 


0.39 


14 


22.7 


coLcondmat 


0.29 


0.64 


0.74 


0.32 


6.75 


18.3 


coLgrqc 


0.31 


0.79 


0.83 


0.33 
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17.7 


coLhepph 


0.42 


0.58 


0.71 


0.33 


11 


19.9 


coLhepth 


0.19 


0.69 


0.75 


0.33 


5.4 


16.7 


emaiLenron 


0.18 


0.5 


0.57 


0.47 


9.34 


49.1 


emaiLeuall 


0.01 


0.73 


0.66 


0.46 


3.98 


409 


football 


0.17 


0.57 


0.88 


0.14 


7.2 


20.9 


livcjournal 


0.19 


-1 


0.53 


0.48 


15.2 


19.3 


P 2p4 


0.007 


0.38 


0.53 


0.6 


8.2 


11.9 


p2p5 


0.006 


0.4 


0.54 


0.59 


8.1 


12.6 


p2p6 


0.006 


0.39 


0.54 


0.59 


8.1 


12.5 


p2p8 


0.015 


0.46 


0.58 


0.54 


7.4 


12.7 


p2p9 


0.014 


0.46 


0.58 


0.54 


7.3 


12.5 


p2p24 


0.002 


0.47 


0.62 


0.48 


5.9 


11.1 


P 2p25 


0.005 


0.49 


0.63 


0.47 


5.8 


11.5 


p2p30 


0.005 


0.5 


0.62 


0.46 


5.8 


11.3 


p2p31 


0.003 


0.5 


0.63 


0.46 


5.7 


11 


roadnet_ca 


0.04 


0.99 


0.93 


0.087 


3.67 


26.7 


roadnet_pa 


0.04 


0.99 


0.93 


0.087 


3.68 


26.9 


roadnet_tx 


0.04 


0.99 


0.93 


0.1 


3.63 


26.2 


vikivote 


0.002 


-1 


0.58 


0.5 


4.89 


496 


web_berkstan 


0.54 


0.91 


0.65 


0.32 


9 


44.4 


web_google 


0.39 


0.92 


0.79 


0.17 


6.68 


30.3 


web_notredame 


0.35 


0.93 


0.76 


0.16 


5 


88.2 


web_stanford 


0.47 


0.88 


0.65 


0.35 


7.9 


40.8 



maybe networks like lattice or with very small average degree whose commu- 
nity structure are debatable. In all, Random Walk Modularity are more strict 
to evaluate the community structure. 

In table [2] we also give some other properties. As a whole, most networks 
tends to have small RWL, small conductance and small communities size. The 
small communities size maybe influenced by our selection of the first optimal 
conductance. The short AWL is clear, even the most largest networks the live- 
Journal online social network and U.S. patent dataset only need about 15 and 
9 steps to form a ring. The RWL seems have a upper bound by the average de- 
gree as analysis before. Owing the influence of community structure, real RWL 
is always smaller than that value. The results show Random Walk Modularity 
are independent on conductance, RWL and community size. If a network has 
high Random Walk Modularity value, we are more believe it has community 
structure. 



5 Conclusion 

In this paper, we propose a method to evaluate community structure and a 
local community algorithm based on random walks with approximately linear 
running time. Our experiments show the average random walk length are af- 
fected by two factors, the average degree of the graph and community structure. 
Average random walk length are very short in real networks, which is either 
caused by networks' sparseness or community structure or both. Such short av- 
erage random walk guarantees Random Walk Modularity could be calculated in 
near linear time. We also give a modularity measure from the conductance view, 
which gives us a profile about a large networks. Usually the Conductance Modu- 
larity and Newman Modularity are consistent in our experiment, while Random 
Walk Modularity could give a different judge. Random Walk Modularity has 
advantageous both in evaluating the community structure and speed. Networks 
with high Random Walk Modularity are more believable to have good commu- 
nity structure, while Newman Modularity and Conductance Modularity could 
also give some ER random network high value. So Random Walk Modularity 
should be used when we cared about the network community structure without 
the debatable community. 

The running time of random ring algorithm is mainly influenced random 
walk length and the random walk number. The former are influence by average 
degree and community structure and is usually small, less than 20 in all network 
in our experiments. N could be set by user where both accuracy and speed 
should be considered. Our results show, with some little accuracy sacrifice we 
could improve the algorithm's speed a lot. The random ring algorithm could be 
used on very large networks with millons or billons of nodes. 

In the future, we will study the evolution of community structure and explain 
why networks form different structures. Methods proposed in this paper could 
help disclosed the large network structure a lot. 
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